Sketch-Based Estimation of Subpopulation-Weight
نویسندگان
چکیده
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [18] (pri) and the classic weighted sampling without replacement (ws). They can be computed efficiently for many representations of the data including distributed databases and data streams. We derive novel unbiased estimators and efficient confidence bounds for subpopulation weight. Our estimators and bounds are tailored by distinguishing between applications (such as data streams) where the total weight of the sketched set can be computed by the summarization algorithm without a significant use of additional resources, and applications (such as sketches of network neighborhoods) where this is not the case. Our rank conditioning (RC) estimator, is applicable when the total weight is not provided. This estimator generalizes the known estimator for pri sketches [18] and its derivation is simpler. When the total weight is available we suggest another estimator, the subset conditioning (SC) estimator which is tighter. Our rigorous derivations, based on clever applications of the Horvitz-Thompson estimator (that is not directly applicable to bottom-k sketches), are complemented by efficient computational methods. Performance evaluation using a range of Pareto weight distributions demonstrate considerable benefits of the ws SC estimator on larger subpopulations (over all other estimators); of the ws RC estimator (over existing estimators for this basic sampling method); and of our confidence bounds (over all previous approaches). Overall, we significantly advance the state-of-the-art estimation of subpopulation weight queries.
منابع مشابه
Estimation of Subpopulation Parameters in One-stage Cluster Sampling Design
Sometimes in order to estimate population parameters such as mean and total values, we extract a random sample by cluster sampling method, and after completing sampling, we are interested in using the same sample to estimate the desired parameters in a subset of the population, which is said subpopulation. In this paper, we try to estimate subpopulation parameters in different cases when one-st...
متن کاملFinding Heavily-Weighted Features with the Weight-Median Sketch
We introduce the Weight-Median Sketch, a sub-linear space data structure that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual infor...
متن کاملCoMMEDIA: Separating Scaramouche from Harlequin to Accurately Estimate Items Frequency in Distributed Data Streams
In this paper, we investigate the problem of estimating the number of times data items that recur in very large distributed data streams. We present an alternative approach to the well-known CountMin Sketch in order to reduce the impact of collisions on the accuracy of the estimation. We propose to decrease, for each concerned item, the over-estimation that results from these collisions. Our sk...
متن کاملA New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation
Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...
متن کاملOffline Sketch Parsing via Shapeness Estimation
In this work, we target at the problem of offline sketch parsing, in which the temporal orders of strokes are unavailable. It is more challenging than most of existing work, which usually leverages the temporal information to reduce the search space. Different from traditional approaches in which thousands of candidate groups are selected for recognition, we propose the idea of shapeness estima...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/0802.3448 شماره
صفحات -
تاریخ انتشار 2008